[1] 2 2 3 5 2
A Bayesian Approach to Modeling Home Run Production in Major League Baseball
In this research project, Dr. Parson and I sought to predict the home run (HR) production of Major League Baseball hitters.
If you work for a MLB team, predicting HR’s is important because it is the pinnacle outcome of an at-bat (AB)
And even if you don’t work for a MLB team, you could make a lot of money in Vegas if you can accurately predict HR’s!
A problem with predicting player HR’s is that there is often limited data available
Generalized linear models (GLM’s) from classical statistics have a hard fitting accurate models given limited data points - such as the 6 data point scenario mentioned above
This is where a Bayesian model shines - Bayesian models can “learn” from the data improving the effective observations it has for fitting
Pros: Bayesian model
Cons:
Sparingly uses multilevel modeling
Priors are uninformative and don’t fit data
Expect average player’s HR probability to be 0.00015
Priors effectively suggest that players could have a HR probability between 0 and 1
Strange choices of parameters
\[ \begin{align} HR_{nip} &\sim Binomial(AB_{nip},\pi_{nip})\\ \frac{\pi_{nip}}{1-\pi_{nip}} &= \alpha_{n}+\beta_n\cdot (Age_{ni}-30)+\eta_n\cdot (Age_{ni}-30)^2\\ &+\delta_p+\xi_i \end{align} \]
The amount of HR’s hit by player \(n\) in year \(i\) played at park \(p\) is binomially distributed, according to player \(n\)’s AB’s and HR probability \(\pi\) for year \(i\) at park \(p\).
The amount of AB’s in a year is given for our prediction of HR’s, but we say that the probability that the AB results in a HR, \(\pi\) varies according to a number of factors.
We can both simulate and predict the amount of HR’s of a player. For example, let’s examine a player that has 100 AB’s and a probability \(\pi\) of 0.03.
That is, \(HR \sim Normal(100,0.03)\),
Simulation:
Prediction:
\(E(HR)=AB\cdot \pi=100\cdot0.03=3\)
Our data for the model comes from the Lahman data set. Lahman covers a variety of information on each player, including but not limited to, batting, fielding, team, and player statistics.
In order to be considered in our analysis a player must have accumulated at least 6 seasons of play, with at least 50 AB’s in each, and have played from 1973-2019.
\[ \begin{align} \alpha_n&\sim Normal(\mu_0, \sigma_0), n\in\{1,...,657\}\\ \\ \mu_0&\sim Normal(-3.5,0.1)\\ \sigma_0&\sim Exponential(1) \end{align} \]
Intercept term which represents “innate” hitting ability of players skilled enough to play in MLB
-3.5 on the logit scale is ≈0.029 or 2.9%
-3.5±0.1 = [-3.6, -3.4] = [0.0266, 0.0323] = [2.7%, 3.2%]
-3.5±2 = [-5.5, -1.5] = [0.0041, 0.182] = [0.4%, 18.2%]
No pooling would mean that each player gets their own posterior \(\alpha\) distribution. That is,
\[ \alpha_n\sim Normal(-3.5,1). \]
This prior would be multiplied by the data to generate the posterior for each player. However, this disadvantages players with less seasons because there is not as much data to go off of. And we know that we could look a player, \(m\), who performs similar to \(n\) WLOG and generate more accurate predictions.
In a total pooling scenario \(\alpha\) would just be a dummy variable in the model of the model,
\[ HR=\tau+\Gamma\cdot AB+\alpha_n+\beta_n+\eta_n+\delta_p+\xi_i+\epsilon \]
But then we run into the problem of multicolinearity. That is, how do we tease out the relationship between \(\alpha_n, \delta_p\), and \(\xi_i\) since they all are intercept terms? This model also has a hard time allowing for players to stray from the mean for any parameter.
In essence, we lose information from both no and total pooling.
\[ \begin{align} \alpha_n&\sim Normal(\mu_0, \sigma_0), n\in\{1,...,657\}\\ \\ \mu_0&\sim Normal(-3.5,0.1)\\ \sigma_0&\sim Exponential(1) \end{align} \]
Intercept term which represents “innate” hitting ability of players skilled enough to play in MLB
-3.5 on the logit scale is ≈0.029 or 2.9%
-3.5±0.1 = [-3.6, -3.4] = [0.0266, 0.0323] = [2.7%, 3.2%]
-3.5±2 = [-5.5, -1.5] = [0.0041, 0.182] = [0.4%, 18.2%]
\[ \begin{align} \beta_n &\sim Normal(\mu_1,\sigma_1), n\in\{1,...,657\}\\ \\ \mu_1 &\sim Normal(0,0.1)\\ \sigma_1 &\sim Exponential(10) \end{align} \]
Multiplicative effect representing how deviation from centered age (Age - 30) affects HR hitting ability
Age plays a factor in hitting HR’s, but it is likely not very large so we have the priors set near 0 to reflect this
\[ \begin{align} \eta_n &\sim Normal(\mu_2, \sigma_2),n\in\{1,...,657\}\\ \\ \mu_2 &\sim Normal(0, 0.01)\\ \sigma_2 &\sim Exponential(100)\\ \end{align} \]
Multiplicative effect representing how deviation from centered age squared [(Age - 30)²] affects HR hitting ability
Used to capture the non-linearity of the data without risk of over-fitting
\[ \begin{align} \delta_p &\sim Normal(\mu_5,\sigma_5),p\in\{1,...,88\}\\ \\ \mu_5 &\sim Normal(0,0.01)\\ \sigma_5 &\sim Exponential(10) \end{align} \]
Intercept term which captures the effect playing in different parks has on HR probabilities
Parks differ by both dimensions and altitude which affects HR rates
\[ \begin{align} \xi_i &\sim Normal(\mu_6,\sigma_6),i\in\{1,...,47\}\\ \\ \mu_6 &\sim Normal(0,0.25)\\ \sigma_6 &\sim Exponential(10) \end{align} \]
Intercept term which captures the effect playing in different years has on HR probability
Changes can occur because of rules, ownership goals, player goals, etc.
This term captures those changes without asking why there are changes
\[ \begin{align} HR_{nip} &\sim Binomial(AB_{nip},\pi_{nip})\\ \frac{\pi_{nip}}{1-\pi_{nip}} &= \alpha_{n}+\beta_n\cdot (Age_{ni}-30)+\eta_n\cdot (Age_{ni}-30)^2\\ &+\delta_p+\xi_i \end{align} \]
There are 2,116 parameters for this model
This model is classically non-identifiable because of 3 intercept terms!
The model predicts that the average player’s \(\pi\) is about 0.03 (on the normal scale) which is what we observe in the data
Uses Bayesian techniques to update estimates based on what the data says - allows for inference
Will our model make the Hood math department excellent gamblers?
Areas of future research
Player archetypes and physical characteristics
Considering more for a longer time interval (like Fellingham and Fisher (2017))
Better data (advanced metrics or play-by-plays)